Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated chunking_document. #65

Merged
merged 1 commit into from
Jul 6, 2024
Merged

Updated chunking_document. #65

merged 1 commit into from
Jul 6, 2024

Conversation

PalmPalm7
Copy link
Contributor

@PalmPalm7 PalmPalm7 commented Jul 2, 2024

Demonstrated new chunking methods in replace of RecursiveCharacterTextSplitter()

Updates:

  1. Applied document-specific test splitter from Langchain in replace of original naive version.
  2. Made heuristics changes to markdown file, especially using regex to trim markdown tables in attempt to fit in the whole table with limited context window.
  3. For updated chunk_document() function, see Chunking_Demo.ipynb on chunking with server_ctx_size=4096, chunk_word_count=1024). Granite 7b has 4k context windows.

Saved for later:

  1. "Soft dependency" freeze and introducing Magika should wait.

For further PR, an efficient library to detect file type could be added for document-specific chunking and heuristics.

  1. How do you handle chunks that are larger than the provided server_ctx_size ?

No logic on larger chunk size updated this PR. "CharacterTextSplitter will only split on separator (which is '\n\n' by default). chunk_size is the maximum chunk size that will be split if splitting is possible." See this StackOverflow Post.

  1. Is it assumed that if the file is not any of code file it always markdown?

My Justification is there is I assume our most common use cases, as we discussed, is PDF into markdown formats, therefore default case should be markdown. Furthermore, by specifying the language param in RecursiveCharacterTextSplitter, it uses these following separators:

RecursiveCharacterTextSplitter.get_separators_for_language(Language.MARKDOWN)

Output:
['\n#{1,6} ',
 '```\n',
 '\n\\*\\*\\*+\n',
 '\n---+\n',
 '\n___+\n',
 '\n\n',
 '\n',
 ' ',
 '']

Instead of these separators.

["\n\n", "\n", " "]

Original PR, see: #45

@abhi1092
Copy link
Member

abhi1092 commented Jul 2, 2024

Tested this PR. Chunking seems to be working as expected. @PalmPalm7 as discussed can you add the handling of document if its not passed as list?

@PalmPalm7
Copy link
Contributor Author

Tested this PR. Chunking seems to be working as expected. @PalmPalm7 as discussed can you add the handling of document if its not passed as list?

Updated chunking method with this logic. Ready to merge @abhi1092 .

Copy link
Member

@russellb russellb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please squash your 3 commits into 1 and add all of the description you have on the PR into the commit message? Let me know if you need any assistance doing this!

@PalmPalm7
Copy link
Contributor Author

PalmPalm7 commented Jul 2, 2024

Can you please squash your 3 commits into 1 and add all of the description you have on the PR into the commit message? Let me know if you need any assistance doing this!

Thank you Russell! I have squashed and amended my commits. Ready to merge @russellb .

Copy link
Member

@russellb russellb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good -- passing e2e CI, and I like that it falls back to the old method in case of an error.

Instead of a raw print() I would prefer using a logger. Can you add a new parameter for logger, update consumers to pass in a logger, and then use that for the messages you're printing? Thanks!

src/instructlab/sdg/utils/chunking.py Outdated Show resolved Hide resolved
@PalmPalm7
Copy link
Contributor Author

looks good -- passing e2e CI, and I like that it falls back to the old method in case of an error.

Instead of a raw print() I would prefer using a logger. Can you add a new parameter for logger, update consumers to pass in a logger, and then use that for the messages you're printing? Thanks!

Done! @russellb

I've added this parameter:
logger = logging.getLogger(__name__) and replaced all existing print functions.

Copy link
Member

@russellb russellb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, nice work!

(you could drop "resolved merge conflict" from the PR title / commit message, but it's not a big deal, and not worth the CI reset probably)

Copy link
Member

@RobotSail RobotSail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments but LGTM overall, great work!

@PalmPalm7 PalmPalm7 changed the title Updated chunking_document. Resolved merge conflict. Updated chunking_document. Jul 2, 2024
@markmc
Copy link
Contributor

markmc commented Jul 3, 2024

Updates:

  1. Used a efficient library to detect file type.

This new dependency is coming at a time when we are in a "soft dependency" freeze - i.e. we are avoiding adding new dependencies, unless there is some exceptional reason to justify it

If we did not add this library now (i.e. waited until the next milestone release has been completed), I guess we would just use language=Language.MARKDOWN and have no intelligent support for code files (go, java, etc.), latex, and html?

IMO changing from the current naive list of separators to language=Language.MARKDOWN is a good change to make now, but adding the new dependency can wait since support for those other file formats isn't a major priority right now

@russellb
Copy link
Member

russellb commented Jul 3, 2024

Updates:

  1. Used a efficient library to detect file type.

This new dependency is coming at a time when we are in a "soft dependency" freeze - i.e. we are avoiding adding new dependencies, unless there is some exceptional reason to justify it

If we did not add this library now (i.e. waited until the next milestone release has been completed), I guess we would just use language=Language.MARKDOWN and have no intelligent support for code files (go, java, etc.), latex, and html?

IMO changing from the current naive list of separators to language=Language.MARKDOWN is a good change to make now, but adding the new dependency can wait since support for those other file formats isn't a major priority right now

That makes a ton of sense. Thanks for thinking this through, Mark.

Markdown is the only knowledge doc format supported via the CLI right now, so it's only format we need to support at the moment.

Copy link
Member

@russellb russellb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing review based on Mark's feedback. I think he's right about the dependency and what features are required right now vs. can come later.

@markmc
Copy link
Contributor

markmc commented Jul 3, 2024

@PalmPalm7 if you split the magika part into a separate PR, we could merge the rest as a super useful improvement to the markdown support 👍

@dhellmann
Copy link

@PalmPalm7 if you split the magika part into a separate PR, we could merge the rest as a super useful improvement to the markdown support 👍

+1 -- magika introduces a dependency on onnxruntime, which is another package that doesn't publish standard source code packages so we'll have to figure out how to build it to prepare a release downstream. It would be nice if we could wait to bring that in until this first major release is settled.

@PalmPalm7
Copy link
Contributor Author

Thank you for taking the time to review and foresee potential risks @dhellmann @markmc @russellb , I'll split it into to two PR.

@PalmPalm7 PalmPalm7 force-pushed the main branch 2 times, most recently from ad106f3 to 19bf54b Compare July 3, 2024 15:27
@PalmPalm7
Copy link
Contributor Author

Updates:

  1. Used a efficient library to detect file type.

This new dependency is coming at a time when we are in a "soft dependency" freeze - i.e. we are avoiding adding new dependencies, unless there is some exceptional reason to justify it
If we did not add this library now (i.e. waited until the next milestone release has been completed), I guess we would just use language=Language.MARKDOWN and have no intelligent support for code files (go, java, etc.), latex, and html?
IMO changing from the current naive list of separators to language=Language.MARKDOWN is a good change to make now, but adding the new dependency can wait since support for those other file formats isn't a major priority right now

That makes a ton of sense. Thanks for thinking this through, Mark.

Markdown is the only knowledge doc format supported via the CLI right now, so it's only format we need to support at the moment.

Updated logic according to Mark and Doug's suggestions above. Ready to merge @russellb .

1. Applied document-specific test splitter from Langchain in replace of original naive version.
2. Made heuristics changes to markdown file, especially using regex to trim markdown tables in attempt to fit in the whole table with limited context window.
3. For updated chunk_document() function, see Chunking_Demo.ipynb on chunking with server_ctx_size=4096, chunk_word_count=1024). Granite 7b has 4k context windows.

Signed-off-by: Andy Xie <[email protected]>
@dhellmann
Copy link

Thank you for taking the time to review and foresee potential risks @dhellmann @markmc @russellb , I'll split it into to two PR.

Thank you for being flexible!

@oindrillac oindrillac requested a review from russellb July 5, 2024 17:24
@oindrillac
Copy link
Contributor

@russellb can you please check if the changes addressed your concern above?

)

# Determine file type for heuristics, default with markdown
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this comment may be out of date, but not a big deal

@russellb russellb merged commit 6842582 into instructlab:main Jul 6, 2024
11 checks passed
@russellb russellb added this to the 0.1.0 milestone Jul 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants